Run-Time Parallelization: Its Time Has Come
نویسنده
چکیده
Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. This type of loop mostly occurs in irregular, dynamic applications which represent more than 50% of all applications [20]. Making parallel computing succeed has therefore become conditioned by the ability of compilers to analyze and extract the parallelism from irregular applications. In this paper we present a survey of techniques that can complement the current compiler capabilities by performing some form of data dependence analysis during program execution, when all information is available. After describing the problem of loop parallelization and its difficulties, a general overview of the need for techniques of run-time parallelization is given. A survey of the various approaches to parallelizing partially parallel loops and fully parallel loops is presented. Special emphasis is placed on two parallelism enabling transformations, privatization and reduction parallelization, because of their proven efficiency. The technique of speculatively parallelizing doall loops is presented in more detail. This survey limits itself to the domain of Fortran applications parallelized mostly in the shared memory paradigm. Related work from the field of parallel debugging and parallel simulation is also described. 1 Automatic Parallelization In recent years parallel computers have finally begun to move from university and national laboratories into the mainstream of commercial computing. This move has been caused by advances in technology (miniaturization), lower costs and an ever increasing need for more computing power. Unfortunately, although the hardware for these systems has been built and is commercially available, exploiting their potential in solving large problems fast, i.e., obtaining scalable speedups, has remained an elusive goal because the existing software to run these machines did not keep up with the technological progress. We believe that a necessary condition for parallel processing to truly become mainstream is to deliver sustainable performance (speedups across a wide variety of applications) while requiring only the same or similar efforts on the part of the user as is the case for sequential computing. We recognize three complementary avenues towards obtaining a scalable speedups on parallel machines: Good parallel algorithms – to create intrinsic parallelism in a program. Development of a standard parallel language for portable parallel programming. Restructuring compilers to optimize parallel code and parallelize sequential programs. We believe parallel algorithms are absolutely essential for writing a program that is to be executed on a parallel system. It is not possible to obtain a scalable concurrent execution from an application that employs inherently sequential methods. Once this is agreed upon, there are two different ways to express this parallelism: explicitly or implicitly. Explicitly written parallel programs can potentially produce good performance if the programmer is very good and the problem size (and difficulty) is manageable. Achieving this requires a parallel language that is both expressive as well as standardized. If we cannot clearly express the algorithmic parallelism then the effort is wasted, and if the language is not a recognized standard, then portability difficulties will make it non-economical. Additionally, we know from experience that even for very well trained people coding only with explicit parallelism may be significantly more difficult and time-consuming than programming in a sequential language. In particular solving concurrency and data distribution issues is a very difficult and error-prone task, which is contrary to our principle that parallel and serial program development should require similar effort. Expressing explicitly all levels ( from instruction to task level) of parallelism within the same language and performing explicit optimizations that are architecture specific are additional difficulties in coding parallel programs. Just as important is the fact that parallel systems don’t run only newly written applications. There is an enormous body of existing software that must be ported and perform well on these new systems. One solution is to rewrite the so named ’legacy’ programs [9], but this could prove to be prohibitively expensive. The alternative is to automatically transform them for concurrent execution by means of a restructuring or parallelizing compiler. This compiler should be able to safely (without errors) detect the available parallelism and transform the code into an explicitly parallel (machine) language. Most likely it would not be able to modify the algorithms used by the original programmer and therefore the performance will be limited (upper bounded by the intrinsic, and limited, parallelism of the original application). We therefore think that the ideal parallelizing compiler should be capable of transforming both ’legacy’ code and modern code into a parallel form. The new codes that use parallel algorithms can be written in an established, standard language like Fortran 77, in an explicitly parallel language or, even simpler, in a sequential language with parallel assertions. Coding in a hybrid language (serial language with non-mandatory directives) should require the least amount of effort: the programmer expresses as much knowledge about the code as he/she has and leaves the rest for automatic processing. Such a compiler must address two major issues: (a) parallelism detection and (b) parallelism exploitation.
منابع مشابه
Worker-checker - A framework for run-time parallelization on multiprocessors
Run-time parallelization is a technique for solving problems whose data access patterns are diicult to analyze at compile time. In this paper we propose a worker-checker framework to classify existing run-time parallelization schemes. From the framework, several new approaches to run-time parallelization can be identiied. The implementation of one such scheme, called the overlapped worker-then-...
متن کاملEfficient parallelization of the genetic algorithm solution of traveling salesman problem on multi-core and many-core systems
Efficient parallelization of genetic algorithms (GAs) on state-of-the-art multi-threading or many-threading platforms is a challenge due to the difficulty of schedulation of hardware resources regarding the concurrency of threads. In this paper, for resolving the problem, a novel method is proposed, which parallelizes the GA by designing three concurrent kernels, each of which running some depe...
متن کاملRun-time Parallelization Techniques for Sparse Applications
Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we have introduced a novel framework for their identification: speculative parallelization. While we have previously shown that this method is inherently scalable its p...
متن کاملTechniques for Reducing the Overhead of Run-Time Parallelization
Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we have introduced a novel framework for their identification: speculative parallelization. While we have previously shown that this method is inherently scalable its p...
متن کاملFinal Report: Combining Interprocedural Compile-Time and Run-Time Parallelization
This report describes the final results of a project revolving around an integrated automatic parallelization system that combines high quality interprocedural compile-time analysis with flexible run-time support. This research builds on our previous extensive work on interprocedural compiletime analysis for parallelization, sponsored by DARPA through contracts at Rice University and Stanford U...
متن کاملHardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors
Run-time parallelization is often the only way to execute the code in parallel when data dependence information is incomplete at compile time. This situation is common in many important applications. Unfortunately, known techniques for run-time parallelization are often computationally expensive or not general enough. To address this problem, we propose new hardware support for e cient run-time...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Parallel Computing
دوره 24 شماره
صفحات -
تاریخ انتشار 1998